Testing Provers on a Grid - Framework Description

نویسندگان

  • Carlos Areces
  • Daniel Gorín
  • Alejandra Lorenzo
  • Mariano Pérez Rodríguez
چکیده

GridTest is a framework for testing automated theorem provers using randomly generated formulas. It can be used to run tests locally, in a single computer, or in a computer grid. It automatically generates a report as a PostScript file which, among others, includes graphs for time comparison. We have found GridTest extremely useful for testing and comparing the performance of different automated theorem provers (for hybrid, modal, first order and description logics). We present GridTest in this framework description in the hope that it might be useful for the general community working in automated deduction. 1 Testing Automated Provers Testing a system is an invaluable source of information about its behaviour. But testing can be difficult and time consuming. This is particularly the case for systems that have to deal with very diverse input (where carrying out exhaustive testing would be impossible), and which need complex computation (that is, systems which are trying to solve problems which are known to be computationally hard). Automated theorem provers are prime examples of systems with these characteristics. The issue of how to perform suitable testing and comparison of automated theorem provers has been largely debated, and different proposals have been presented (see, e.g., [1–10]) for logics ranging from propositional to first-order. Which kind of testing is most adequate — i.e., whether the problems are randomly generated, hand tailored, or coming from real world applications; how is the potential input space covered; etc. — is difficult to evaluate. Probably, the only safe thing to say is that the more you test the best it is (and still, it might be that no testing is better than bad testing, depending on the claims that are put forward in the basis of the tests performed). In this article we will not discuss the issue, and we will focus on one particular kind of testing which is well suited to be distributed in a computer grid. In particular, we will present GridTest, a framework for testing automated theorem provers using randomly generated formulas. GridTest has been developed in python [11], and can be used to run tests locally, in a single computer or in a computer grid. It automatically generates a report as a PostScript file (generated via LTEX). It can compile statistics provided by the provers (e.g., running time, number of applications of a particular rule, open/closed branches, etc.) and produce graphs generated using GnuPlot. Even if the prover does not provide any kind of statistics, GridTest will use the time command (available in all POSIX-conformant operating systems) to obtain running times to plot in the final report. The framework is released under the GNU General Public License and is available at http://glyc.dc.uba.ar/intohylo/gridtest.php. GridTest has been originally designed for automatizing tests as those described in [8]. That is, we use a random generator of formulas in conjunctive normal form to obtain batches of formulas with an increasing number of conjunctions. In this way, we can explore the average behaviour of the provers in a spectrum of formulas that runs from mostly satisfiable to mostly unsatisfiable, aiming to hit the point of maximum uncertainty (i.e., where the chances of a randomly generated formula being satisfiable or unsatisfiable is balanced). We have used GridTest mostly to test theorem provers for hybrid logics and hence the current framework uses hGen as its random formula generator [12]. hGen (also distributed under the GNU General Public License and available at http://glyc.dc.uba.ar/intohylo/hgen.php) is a generator of random formulas in the hybrid language H(@, ↓, A, D, P) (that is, the basic modal logic extended with nominals, the satisfiability operator @, the ↓ binder, the universal modality A, the difference modality D, and the past modality P, see [13]). But because we were interested in the comparison of provers for different logics (e.g., description and first order logics) the framework has been designed to properly handle translations between the output format of the random generator and the input format of the different provers. In particular, timing graphs will discriminate between the time used for the translation and the time used by the prover. A number of translations from the output format of hGen to the input format of different provers is provided with the source code (e.g., the tptp format for first-order provers, the standard input format for description logic provers, etc.), together with the drivers for different provers (e.g., E, SPASS, Bliksem, Vampire, Racer, FaCT++, HTab, HyLoRes, etc.). In order to run on a single machine, GridTest requires a python interpreter and some typical POSIX tools (bash, time, tar, etc.). Its output is a collection of GnuPlot and LTEX scripts that are automatically compiled into a PostScript file reporting the results. These requirements are fairly typical and are available on almost every platform. Unfortunately, unlike the POSIX standard for operating systems, there is as yet no standard batch scheduling mechanism for computer clusters or grids. GridTest currently supports only one backend, namely the OAR [14] batch scheduler, to distribute the test on a computer cluster. We designed GridTest so as to use only very basic services that most batch schedulers should provide, but still, porting it to other systems could be the most difficult challenge when trying to use GridTest elsewhere. We have found GridTest extremely useful for testing and comparing the performance of different automated theorem provers. Thus, we present this framework description, and we freely release the source code, in the hope that it might be useful for the general community working in automated deduction. 2 Testing on the Basis of Random Formulas The testing methodology implemented in GridTest uses a customizable random formula generator. We currently use hGen [12], although it should be easy to add support for additional generators. Similarly to other random formula generators (e.g., [1, 8]), hGen generates formulas in a conjunctive normal form: each formula is a conjunction of disjunctive clauses. Since each disjunctive clause can be seen as an additional constraint on satisfying models, random formulas with a small number of clauses tend to be satisfiable while a large enough number of random clauses will be unsatisfiable. By generating formulas with an increasing number of disjunctive clauses, we can generate tests that start with formulas with a high chance of being satisfiable, and progresively obtain formulas with a high chance of unsatisfiability, going through the point where these probabilities are roughly the same. Formulas at this spot tend to be difficult for most provers, regardless of whether they are naturally biased towards satisfiable or unsatisfiable formulas. Of course, the precise number of clauses needed to reach this point varies depending on other parameters, such as the number of proposition symbols, the number and kind of modalities, the maximum modal depth, etc. With a random formula generator like hGen, we can set up the following benchmark: i) generate random formulas φ1 . . . φn where φi has exactly i conjunctions (each conjunct being a disjunctive clause) and the rest of the parameters are fixed, ii) run provers p1 . . . pk over each of the n random formulas, using a fixed time limit per formula, iii) collect data of interest about each run (execution time, answer, number of rules fired if available, etc.) and plot it for comparison. Of course, this experiment is not statistically relevant by itself (because the input formula used in each data point has been generated randomly). However, by repeating it a sufficiently large number of times (or, equivalently, using batches of formulas sufficiently large for each data point) and using a statistical estimator on the sampled data (e.g., average, median, etc.) statistically relevant results can be obtained. One can use this testing methodology for different purposes. For example, by comparing the response of every prover on a particular formula, we have discovered inconsistencies which, in turn, allowed us to find and correct implementation errors in the provers we develop. We have also used this tests to assess the effectiveness of optimizations; this was done by comparing alternative versions of the same prover and looking at running times, number of rules fired, clauses generated, etc. Finally, one can also use this methodology to evaluate the relative performance of different provers. Of course, given the artificial nature of the formulas used in the test, one must be careful about the conclusions drawn. This methodology is conceptually simple, but it presents a clear drawback: even for tests of a moderate size, if we generate non-trivial formulas and allow each prover to run for a reasonable amount of time, the total running time on a single computer can quickly become large. Tests with running times measured in days (or weeks) become common. This is especially true if some of the provers involved in the test tend to time out often. If we are interested in using this form of testing as part of the development process of a prover, rapid availability of the results is crucial. Notice though, that because of the nature of the tests (and specially, if we are more interested in qualitative data, rather than in quantitative data), we are not obliged to run all the tests serially in the same computer. We can, as well, obtain statistical relevance by distributing the tests on a computer cluster: each machine runs the complete tests on batches of smaller size and the data is pulled together for statistical analysis when all the runs are completed. Concretely, instead of running a test with batches of size b on a single computer, we could alternatively run n tests on n different computers, each having to process a batch of size b/n, obtaining a linear reduction on the time required. Although large computer clusters are not ubiquitous, the recent emergence of grid computing technologies is giving researchers access to a very large number of computing resources for bounded periods of time. In this scenario, it is not unreasonable to assume the simultaneous availability of such a number of computers. As we mentioned, even if the grid is composed of heterogeneous machines (different processors, clock speeds, cache memory sizes, etc.), qualitative result (i.e., the relative performance of the provers under evaluation) would not be affected. A computer cluster seems perfectly suited to run this kind of tests. 3 Installing and Using GridTest The directory structure of a GridTest installation looks like this: gridtest bin ........................... binaries (provers, etc.) to be run locally sbin......................binaries (provers, etc.) to be run on the grid configuration.................................... test configurations drivers...............................drivers for the different provers A more detailed description of what each directory is supposed to contain is as follows: bin. Binaries for all the provers, translators and for the formula generator used in a test to be run on the local machine. sbin. Binaries involved in a test on a grid. The architecture and/or operating system on the grid may differ from that on the local machine and therefore it is better to keep those binaries in a separate directory. Observe that even if the architecture and operating system matches, there still can be differences in the particular versions of the libraries installed; therefore it is convenient to use statically-linked binaries for testing on the grid. The makeEnv.sh script (see below) takes the binaries from this directory. configurations. The configuration of the test to run typically resides here. The installation comes with several examples that can be taken as basis for new tests. drivers. This directory contains python driver files for the different provers. The installation comes with drivers for various theorem provers (e.g., Vampire, SPASS, E, Bliksem, Racer, FaCT++, HTab, HyLoRes, etc.). A new driver should be written for each additional prover to be used with the testing framework. A driver specifies how the prover should be ran on a particular input file (for example, which command line parameters the prover takes) and how the answers from the prover should be understood. Typically the latter involves parsing the provers output (saved to a file) and collecting relevant data. Suppose that we have the executables for the provers for hybrid logics HyLoRes [15, 16] and HTab [17] in the directory gridtest/bin. Suitable drivers for these two provers are already provided in gridtest/drivers as the files htab.py and hyloresv2 5.py. An example of a possible test configuration using these two provers is provided in gridtest/configurations as the file template.py (shown in Figure 1). With all this in place, we can launch a local test simply with the command: ./testRunner.py configurations/template.py The result will be the creation of a directory called scratch-BENCHMARK NAME. RANDOM SUFFIX . The RANDOM SUFFIX is added to avoid file name clashes when running the same configuration more than once. Before explaining the structure of this directory, let’s have a look at the template.py file. The first line in Figure 1 imports the class RandomHCNFTest. A configuration file is expected to define a function configureTest() that returns a properly configured instance of this class. Most of the parameters of the test can be set using the constructor. Many of them will be used by hGen to generate formulas (e.g., the number and type of atoms that can appear, and their relative frequency; the kind of modalities that can appear, their maximum nesting and their relative frequency). The parameters under the “Test structure” label define the general configuration of the test: the number of formulas per batch, the range of clauses to generate (together with the step used to increase the number of clauses in each batch) and the time limit used for each run of a prover on one formula in the benchmark. The final lines in the configureTest() function respectively define the output directory name and set the provers to be run. The latter is done by passing a list of drivers to the setProvers()method. In this example, the list of drivers is built in the proverConfiguration() function: HTab and HyLoResV2 5 are the classes of the drivers for each prover and their constructors expect an id for the prover, the directory where the binary is located (dirStructure.binDir defaults to the bin directory) and the name of the binary. As we mentioned, the benchmark will generate a directory called scratchTemplate.RANDOM SUFFIX . The general structure of the resulting directory is as follows: scratch-BENCHMARK NAME.RANDOM SUFFIX tests....................the batches of formulas generated for the test batch-1 ... batch-n reposonses...................... the responses of the different provers prover1 batch-1 ... batch-n ... proverk batch-1 ... batch-n results...........................the report and its intermediate files In the tests directory we find all the input formulas generated for the test. These formulas are stored so that the test can be reproduced in the future, by running the prover on the exact same benchmark. In the responses directory we find the responses of the different provers (captured by redirecting stdout to a file) when ran on each input formula. In addition, for each run, we will find there also the running time obtained using the time command (both for the prover and for the translation, if a translation is needed). The information in the responses directory is used to create the report found in the results directory. As discussed in Section 2, the report will contain at least a graph corresponding to the satisfiable/unsatisfiable/timeout functions for each of the provers tested, together with graphs corresponding to real and user median time (showing deviation) for all the provers (an example of these graphs are shown in Figures 3 and 4). In addition, the report will also show graphs corresponding to cases of unique responses (i.e., formulas that where solved by only one of the provers), and it will explicitly list formulas where provers have provided contradictory responses (disregarding timeouts). These formulas are particularly interesting since they typically witness to implementation errors on some of the provers. Other graphs can easily be generated (average number of clauses generated by batch, average number of open/closed branches, etc.), depending on the statistics collected by each prover driver. from randomHCNFTest import RandomHCNFTest from htab import HTab from hyloresv2_5 import HyLoResV2_5 def configureTest(): test = RandomHCNFTest(testId = ’Template’, # Number of atoms numOfProps = 2, freqOfProps = 2, numOfNoms = 2, freqOfNoms = 2, numOfSVars = 0, freqOfSVars = 0, # Number and depth of modalities numOfRels = 2, maxDepth = 2, diamDepth = 1, freqOfDiam = 1, atDepth = 1, freqOfAt = 1, downDepth = 0, freqOfDown = 0, invDepth = 0, freqOfInv = 0, univDepth = 1, freqOfUniv = 1, diffDepth = 0, freqOfDiff = 0, # Test structure batchSize = 5, fromNumClauses = 10, toNumClauses = 20, step = 2, timeout = 10 ) test.dirStructure.setNewScratchDirFor(test.testId) test.setProvers(proverConfiguration(test.dirStructure))

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Software Abstractions for Description Logic Systems

We explain the basics of description logics and tableau provers for reasoning with them. The implementation of tableau provers is a complicated matter and demanding from a software engineering point of view. We present and discuss their implementation in COMMON LISP and motivate and introduce some novel software abstractions for description logic system construction, which are embodied in the M...

متن کامل

Using Grid Computing for Distributed Software Testing

Grid computing, which is characterized by largescale sharing and collaboration of dynamic resources, has quickly become a mainstream technology in distributed computing. In this article, we present a grid-based software testing framework, which takes advantage of the large-scale and cost-efficient computational grid resources as a software testing testbed to support automated large-scale and co...

متن کامل

Entity Management and Security in P2P Grid Framework

During the last decade there has been a huge interest in Grid technologies, and numerous Grid projects have been initiated with various visions of the Grid. While all these visions have the same goal of resource sharing, they differ in the functionality that a Grid supports, characterization, programming environments, etc. In this paper we present a new Grid system dedicated to deal with data i...

متن کامل

Testing First-Order Logic Axioms in Program Verification

Program verification systems based on automated theorem provers rely on user-provided axioms in order to verify domain-specific properties of code. However, formulating axioms correctly (that is, formalizing properties of an intended mathematical interpretation) is nontrivial in practice, and avoiding or even detecting unsoundness can sometimes be difficult to achieve. Moreover, speculating sou...

متن کامل

Interpreting the Validity of a High-Stakes Test in Light of the Argument-Based Framework: Implications for Test Improvement

The validity of large-scale assessments may be compromised, partly due to their content inappropriateness or construct underrepresentation. Few validity studies have focused on such assessments within an argument-based framework. This study analyzed the domain description and evaluation inference of the Ph.D. Entrance Exam of ELT (PEEE) sat by Ph.D. examinees (n = 999) in 2014 in Iran....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009